Take-home_Ex02

Author

NeoYX

Published

May 15, 2023

Modified

May 28, 2023

Note

Edge data should be organised as such: (can use dplyr methods)

First column: Source id (FK to Node second column) - compulsory

Second column: Target id (FK to Node second column) - compulsory

Node data

First column: ID (contains all the distinct values of source and target in Edge data) - compulsory

  • Nodes present in edge data must exists in ID of node data, must not have missing in node ID.

Second column: Label (only need if Id are all integers)

Warning

Try not to use R built-in NA/NULL function. Manually type “unknown’ / ‘missing’ as a value instead.

Vast Challenge 2023 Mini Challenge 2 (Subtask: 1)

In this challenge, my group and I seek to use visual analytics to identify temporal patterns for individual entities and between entities using the knowledge graph the FishEye Organisation has provided us with. We will also be categorising the type of business relationship patterns found. The visualisation outputs:

1) Interactive network graph with nodes coloured by varying range of betweenness, in-deg and out-deg centrality scores to spot the companies with high / low centrality scores easily.

2) Heatmap for top n companies (shipping and receiving) over the years 2028 to 2034 allows us to see all trading activities at one glance. We could try to spot companies with short trading duration.

3) Using two top receiving companies as an example, iteractive time-series charts (line chart) are plotted with four of their supplier companies to spot trends and anomalies. We could try to spot receiving companies who keeps changing suppliers with this.

4) Coordinated and interactive scatterplots of ‘number of interactions’ and ‘partnership duration’ between each unique pair of ship-receive companies were plotted. We could try to spot outliers and investigate them!

1 About the dataset

1.1 Data dictionary

Node Attributes:

  • id -- Name of the company that originated (or received) the shipment

  • shpcountry -- Country the company most often associated with when shipping

  • rcvcountry -- Country the company most often associated with when receiving

  • dataset -- Always ‘MC2’

Edge Attributes:

  • arrivaldate -- Date the shipment arrived at port in YYYY-MM-DD format.

  • hscode -- Harmonized System code for the shipment. Can be joined with the hscodes table to get additional details.

  • valueofgoods_omu -- Customs-declared value of the total shipment, in Oceanus

  • Monetary Units (OMU)

  • volumeteu -- The volume of the shipment in ‘Twenty-foot equivalent units’, roughly how many 20-foot standard containers would be required. (Actual number of containers may have been different as there are 20ft and 40ft standard containers and tankers that do not use containers)

  • weightkg -- The weight of the shipment in kilograms (if known)

  • dataset -- Always ‘MC2’

  • type -- Always ‘shipment’ for MC2

  • generated_by -- Name of the program that generated the edge. (Only found on ‘bundle’ records.)

1.2 Importing the datasets

Import libraries

The new libraries used today are :

  • jsonlite to import json file
Show the code
pacman::p_load(jsonlite, igraph, tidygraph, ggraph,
               visNetwork, lubridate, clock,
               tidyverse, graphlayouts,knitr,plotly, 
               ggHoriPlot, ggthemes,hrbrthemes,treemap,patchwork, ggiraph,
               ggstatsplot)
Show the code
MC2 <- jsonlite::fromJSON("C:/yixin-neo/ISSS608-VAA/Project/data/mc2_challenge_graph.json")

Pull out the nodes and edge data and save them as tibble data frames.

Show the code
MC2_nodes <- as_tibble(MC2$nodes) %>% 
  select(id,shpcountry,rcvcountry)
Rows: 34,576
Columns: 3
$ id         <chr> "AquaDelight Inc and Son's", "BaringoAmerica Marine Ges.m.b…
$ shpcountry <chr> "Polarinda", NA, "Oceanus", NA, "Oceanus", "Kondanovia", NA…
$ rcvcountry <chr> "Oceanus", NA, "Oceanus", NA, "Oceanus", "Utoporiana", NA, …

Rearranging the columns in edge file as we require source and target columns to be the first two columns.

Show the code
MC2_edges <- as_tibble(MC2$links) %>% 
  select(source,target,arrivaldate,hscode,valueofgoods_omu,volumeteu,weightkg,valueofgoodsusd)  
Rows: 5,464,378
Columns: 8
$ source           <chr> "AquaDelight Inc and Son's", "AquaDelight Inc and Son…
$ target           <chr> "BaringoAmerica Marine Ges.m.b.H.", "BaringoAmerica M…
$ arrivaldate      <chr> "2034-02-12", "2034-03-13", "2028-02-07", "2028-02-23…
$ hscode           <chr> "630630", "630630", "470710", "470710", "470710", "47…
$ valueofgoods_omu <dbl> 141015, 141015, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ volumeteu        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ weightkg         <int> 4780, 6125, 10855, 11250, 11165, 11290, 9000, 19490, …
$ valueofgoodsusd  <dbl> NA, NA, NA, NA, NA, NA, 87110, 188140, NA, 221110, 58…

2 Data cleaning

2.1 Lets check for duplicates

For MC2_nodes dataframe:

There are no duplicated nodes, which is great.

Show the code
# check for any duplicates
any(duplicated(MC2_nodes))

For MC2_edges dataframe:

There are about 155291 records (2% out of total records) that are duplicated.

Show the code
#duplicated only
# print(any(duplicated(MC2_edges)))
MC2_edges_dup <- MC2_edges[duplicated(MC2_edges), ]
print(nrow(MC2_edges_dup))
[1] 155291

We will drop the duplicates.

Show the code
# Drop duplicate rows from the dataframe
MC2_edges_no_dup <- MC2_edges[!duplicated(MC2_edges), ]

2.2 Check for null values

Check whether each column in MC2_nodes and MC2_edges contains null and prints the percentage of null for each column.

For MC2_nodes dataframe:

There are no null values in the id column of Nodes file, which is great.

Show the code
# Check for null values in each column
null_counts_nodes <- sapply(MC2_nodes, function(x) sum(is.null(x) | is.na(x)))

# Calculate the percentage of null values for each column
null_percentages_nodes <- null_counts_nodes / nrow(MC2_nodes) * 100

# Display the results
#print(null_percentages)

knitr::kable(null_percentages_nodes, "simple", col.names = c("Null Percentage"))
Null Percenta ge
id 0.00000
shpcountry 64.66624
rcvcountry 8.41335

For MC2_edges dataframe:

As there are a lot zeros inside MC2_edges$volumteu col, we will consider 0 as equivalent to null values.

We can see that the columns valueofgoods_omu and volumeteu are mainly null. valueofgoodusd column contains more than 50% null values. There are 4 records of source with 0 as value, but 0 is their unique identifier so we do not consider 0 as null in source column. It means to say that only source, target, arrivaldate, hscode and weight columns will be helpful in our analysis.

Show the code
# Check for null values in each column
null_counts <- sapply(MC2_edges_no_dup, function(x) sum(is.null(x) | is.na(x) | x==0))

# Calculate the percentage of null values for each column
null_percentages <- null_counts / nrow(MC2_edges_no_dup) * 100

# Display the results
#print(null_percentages)

knitr::kable(null_percentages, "simple", col.names = c('Percentage null'))
Percentage null
source 0.0000753
target 0.0000000
arrivaldate 0.0000000
hscode 0.0000000
valueofgoods_omu 99.9947072
volumeteu 85.2925183
weightkg 0.9012849
valueofgoodsusd 54.4461223

We will be dropping the valueofgoods_omu , valueofgoodusdand volumeteu columns from our dataframe by selecting only the columns that we need.

Show the code
MC2_edges_no_dup <- MC2_edges_no_dup %>% select('source','target', 'arrivaldate', 'hscode','weightkg')

2.3 Check on the HScodes

Check the unique number of hscodes in the dataset. There are 4761 unique HScodes, however many are not related to fishing related activties.

Show the code
# Find the number of unique values in hscode
length(unique(MC2_edges_no_dup$hscode))
[1] 4761

With reference two websites on “Harmonized System of Nomenclature” namely, World Custom Organisation Harmonized System codes and connect2India , we will filter for records that have HScodes starting with 1604xx, 1605xx and 301xxx to 308xxx as they refer to seafood commodities, thus removing many other transactions like ‘television’, ‘steel parts’ etc… This will help to filter away the noises and help us to focus on trading activities related to the fishing industry.

Show the code
mc2_seafood_edges<- MC2_edges_no_dup[grepl('^1605|^1604|^301|^302|^303|^304|^305|^306|^307|^308', MC2_edges_no_dup$hscode), ]

2.4 Preparation of Edges

We will perform a group by ‘source’, ‘target’ and ‘arrivaldate’ and aggregate the total count of interactions, ‘Weight’, between each pair of companies. At this moment, we should not be filtering the records because we would like to calculate the network centrality scores first before zooming into records of interest. As we are interested to see whether there are patterns of self-loops, we will not be removing any company shipping to itself.

Show the code
#unique(mc2_seafood_edges$hscode)
#unique(mc2_seafood_edges$source)
mc2_seafood_edges_agg <- mc2_seafood_edges %>%  
  group_by(source, target,arrivaldate) %>% 
  summarise(weight=n(),
            sum_goods_weightkg = sum(weightkg),
            hscode=first(hscode)) %>% 
  #filter(Weight >=8) %>% 
  ungroup()

Let us wrangle the date columns to prepare dataframe for temporal analysis later.

(1) change the arrivaldate column to date data type

Show the code
mc2_seafood_edges_agg$arrivaldate <- as.Date(mc2_seafood_edges_agg$arrivaldate)

(2) create year, month, weekday, weeknumber columns

Show the code
mc2_seafood_edges_agg <- mc2_seafood_edges_agg %>% 
  mutate(year = year(arrivaldate)) %>% 
  mutate(month = month(arrivaldate)) %>% 
  mutate(day = day(arrivaldate)) %>% 
  mutate(weekday = wday(arrivaldate,
                        label= TRUE,
                        abbr = FALSE)) %>% 
  mutate(weeknumber = isoweek(arrivaldate))

TBC: Inspect the frequency of source and target actors, and remove those actors below a frequency count of 5.

TBC: First , we remove low frequency source actors under 5 counts.

Next, remove target actors with frequency count less than 5:

2.5 Preparation of Nodes

We will include only nodes that are in ‘source’ and ‘target’ columns in the mc2_seafood_edges_agg dataframe after the first round of data filtering.

Show the code
nodes_seafood <- MC2_nodes %>%
  filter (id %in% c(mc2_seafood_edges_agg$source, mc2_seafood_edges_agg$target))

2.6 Creating the network graph dataframe using tbl_graph() of the tidygraph package.

Note

Node file needs to have ID of nodes as first column.

Edge file need to contain source and target as column 1 and 2.

To create the network graph dataframe

Show the code
seafood_graph<- tbl_graph(nodes=nodes_seafood,
                          edges = mc2_seafood_edges_agg,
                          directed = TRUE)

The dataframe ‘seafood_graph’ has 11539 nodes with 374709 edges. It is a directed graph with 214 components.

# A tbl_graph: 11539 nodes and 374709 edges
#
# A directed multigraph with 214 components
#
# A tibble: 11,539 × 3
  id                                shpcountry rcvcountry
  <chr>                             <chr>      <chr>     
1 AquaDelight Inc and Son's         Polarinda  Oceanus   
2 Yu gan  Sea spray GmbH Industrial Oceanus    Oceanus   
3 Olas del Mar Worldwide            Oceanus    Oceanus   
4 French Crab S.p.A. Worldwide      Kondanovia Utoporiana
5 Panope Limited Liability Company  Vesperanda Oceanus   
6 hǎi dǎn Corporation Wharf         Marebak    Oceanus   
# ℹ 11,533 more rows
#
# A tibble: 374,709 × 11
   from    to arrivaldate weight sum_goods_weightkg hscode  year month   day
  <int> <int> <date>       <int>              <int> <chr>  <dbl> <dbl> <int>
1  1632  5401 2029-12-21       2              42950 307590  2029    12    21
2  1632  5401 2030-01-19       2              42585 307590  2030     1    19
3  1632  5401 2030-06-25       1              21110 307590  2030     6    25
# ℹ 374,706 more rows
# ℹ 2 more variables: weekday <ord>, weeknumber <dbl>

Running the code chunk below to confirm that ‘seafood_graph’ is not a connected graph.

Show the code
is.connected(seafood_graph)
[1] FALSE

2.7 Calculate the various centrality measures of seafood_graph.

The top 10 nodes with reference to various centrality scores are printed using kable() function from knitr.

Reference was made from this link. The tidyverse centrality functions can be taken from here.

First compute ‘betweenness’, ‘in-deg’ and ‘out-deg’ and ‘pagerank’ scores. All my betweenness scores were zero (Investigation in progress)…

Show the code
seafood_graph<- seafood_graph %>%
  activate("nodes") %>% 
  mutate(betweenness_centrality = centrality_betweenness(directed = TRUE)) %>% 
  mutate(in_deg_centrality = centrality_degree(weights = weight, 
                                               mode = "in")) %>% 
  mutate(out_deg_centrality = centrality_degree(weights = weight, 
                                               mode = "out")) %>% 
  mutate(pagerank = centrality_pagerank(weights = weight,
                                        directed = TRUE)) #%>% 
  #mutate(community = as.factor(group_edge_betweenness(weights = Weight, 
                                                      #directed = TRUE,
                                                      #n = 10)))

Let us take a look at the top 10 nodes with high ‘betweenness’ centrality scores:

To see the top 20 nodes with ‘out-deg’ scores:

3 Interactive graphs of network structure, temporal analysis and business patterns detection

3.1 Interactive graph for top 3 BETWEENNESS CENTRALITY companies

We will attempt to plot a network involving the top 3 companies including ‘Drakensberg S.A. de C.V. Marine ecology’,‘Playa de Oro BV’ and ‘bái suō wěn lú S.p.A.’ in terms of high betweenness centrality scores. We might be able to see their relationship through the graph.

Getting a list of top 3 companies in terms of betweeness scores.

Show the code
n <- 3
top_n_bet_list <- seafood_graph %>% 
  activate("nodes") %>% 
  as_tibble() %>% 
  arrange(desc(betweenness_centrality)) %>% 
  top_n(n,wt = betweenness_centrality) %>%
  pull(id)

Filter all records in the edge file involving the top 3 companies. There are 586 records after this step.

Show the code
mc2_seafood_edges_agg_bet <- mc2_seafood_edges_agg %>%
  filter(target %in% top_n_bet_list | source %in% top_n_bet_list)

The code chunk below first extracts the node dataframe from tbl_graph() object ‘seafood_graph’ created earlier. The reason for doing so is because we want the centrality values inside for tooltip later.

Next, trim this node dataframe by including nodes found only in the ‘from’ and ‘target’ columns in ‘mc2_seafood_edges_agg_bet’ edge file. There are 96 nodes after this step.

Show the code
nodes_seafood_bet <- as.data.frame(seafood_graph %>% activate(nodes))

nodes_seafood_bet <- nodes_seafood_bet %>%
  filter (id %in% c(mc2_seafood_edges_agg_bet$source, mc2_seafood_edges_agg_bet$target))

Calculate the total goods shipped between each pair of companies (for edge tooltip later)

Show the code
mc2_seafood_edges_agg_bet <- mc2_seafood_edges_agg_bet %>%
  group_by(source, target) %>%
  mutate(total_shipped_weightkg = sum(sum_goods_weightkg))

Rename the first two columns in the edge file to from and to for visNetwork to be able to recognize them. Create a title column for visNetwork to display the tooltip when we hover our mouse over the edges later.

Show the code
mc2_seafood_edges_agg_vis_bet <- mc2_seafood_edges_agg_bet %>% 
  rename(from = source) %>% 
  rename(to = target) %>% 
  mutate(title = paste('Total shipment weight = ',total_shipped_weightkg, "\n HSCODE =", hscode))

Similarly, in the nodes file we add a column title We will be displaying the ‘in-deg’, ‘betweenness’ and ‘out-deg’ scores in the tooltip. If we want to colour the nodes by their shipping countries, then we would have to rename the shpcountry column to group because visNetwork looks for group column to colour the nodes. However, we will not do this now.

Show the code
# further processing
nodes_seafood_vis_bet <- nodes_seafood_bet %>% 
  #rename(group= rcvcountry)  %>% 
  mutate(pagerank = round(pagerank, 5)) %>% 
  mutate(title = paste('shpcountry =', shpcountry, ',',
                       'rcvcountry =', rcvcountry, ',',
                       '\n In-deg = ',in_deg_centrality, ',',
                       "\n Betweenness =", betweenness_centrality, ',',
                       "\n Out-deg =", out_deg_centrality))

As we want the nodes to be colour-coded by the betweenness centrality scores, we have first have to bin the betweenness scores into intervals using the cut() function. Next, rename the bet_grp column to group column for VisNetwork to colour nodes by betweenness intervals.

Show the code
bet_brks <- c(0, 500000, 1000000, 1500000)
grps <- c('500,000 & Below','500,001 -1,000,000', '1,000,000 - 1,254,681')

nodes_seafood_vis_bet$bet_grp <- cut(nodes_seafood_vis_bet$betweenness_centrality, breaks=bet_brks, labels = grps,include.lowest = TRUE)

#nodes_seafood_vis_in$in_deg_grp <- factor(nodes_seafood_vis$in_deg_grp, ordered = TRUE, levels = c('3001-6132','2001-3000','1001-2000','501-1000','500 & Below'))

nodes_seafood_vis_bet <- nodes_seafood_vis_bet %>% 
  rename(group = bet_grp)

The code chunk below plots the interactive network graph.

Show the code
set.seed(1234)
visNetwork(nodes_seafood_vis_bet,
           mc2_seafood_edges_agg_vis_bet,
           main = "Ego Network of top 3 Betweenness companies",
           height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_nicely") %>%
  visEdges(arrows = 'to',
           smooth = list(enables = TRUE,
                         type= 'straightCross'),
           shadow = FALSE,
           dash = FALSE) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "group") %>%
  visInteraction(hideEdgesOnDrag = TRUE) %>% 
  visLegend() %>%
  visLayout(randomSeed = 123)
Note

Interactive features of the graph above

(1) Select Id dropdown list

(2) Select Group dropdown list: The values inside refers to the range of ‘in-deg’ centrality scores of the nodes. The pink colour node will represent the highest in-deg centrality score, followed by green, yellow, red and blue.

(3) Zoom in to see the node labels, and arrows direction.

(4) Drag a particular node away from the cluster to admire it.

(5) Hover mouse over a node will display tooltip (shpcountry, rcvcountry, In-deg, pagerank and out-deg score). It will also display the ‘ego’ network with itself at the ego. Click on the node to freeze the ego network. Click on blank space to reset.

(6) Hovering the mouse over an edge will display tooltip (Total weight of cargo, hscode of cargo)

(7) Click and Drag on the graph to move the canvas around, will also temporary disable the edge lines.

The graph above shows the egonetwork of the top 3 betweenness companies and we can even see their tradiing relationship. The nodes in red are the companies with the top 2 betweenness centrality values, the nodes in yellow are ranked lower while the nodes in blue have relatively much lower centrality scores. The table below shows the top 10 companies in terms of centrality values.

Show the code
seafood_graph %>% 
  activate("nodes") %>% 
  as_tibble() %>% 
  arrange(desc(betweenness_centrality)) %>% 
  select(id,betweenness_centrality, out_deg_centrality,in_deg_centrality) %>% 
  head(n=10) %>%
  kable()
id betweenness_centrality out_deg_centrality in_deg_centrality
Drakensberg S.A. de C.V. Marine ecology 1254680.7 18 133
Playa de Oro BV 1231661.8 605 15
bái suō wěn lú S.p.A. 839665.4 42 95
Punjab s Marine conservation 804016.8 181 2716
David Ltd. Liability Co Forwading 640974.3 70 4473
Yenisei Eel GmbH & Co. KG Services 599808.9 1871 500
Costa de Oro S.p.A. 563721.1 113 175
Marine Mates NV Worldwide 560574.1 424 874
Liumbwe GmbH & Co. KG 551504.4 130 3
Turkish Salmon A/S Marine 470845.1 699 1132

It is observed that the companies with high betweenness scores served as both ‘source’ and ‘target’ companies; meaning they buy and sell seafood related products. Companies like ‘Drakensberg S.A. de C.V. Marine ecology’ focus more on buying than selling, while ‘Playa de Oro BV’ focuses more on selling than buying.

The business relationship observed: ‘Drakensberg S.A. de C.V. Marine ecology’ (ranked 1st) is one of the suppliers to ‘Playa de Oro BV’ (ranked 2nd) who in turns supplies to ‘bái suō wěn lú S.p.A.’ (ranked 3rd) who sells to many other companies. To reveal more relationships, the graph can be extended (to see the egonetwork of more yellow nodes ) by increasing the top n companies in the first code chunk in this section. For example, if we increase the size of the our network graph to ego network of top 5 betweenness companies (instead of 3), we would get

3.2 Interactive graph for top 5 IN_DEG CENTRALITY companies

In this section, we will focus on companies with top in-deg centrality scores.

First, obtain the list of top 5 in deg companies

Show the code
n <- 5
top_n_in_list <- seafood_graph %>% 
  activate("nodes") %>% 
  as_tibble() %>% 
  arrange(desc(in_deg_centrality)) %>% 
  top_n(n,wt = in_deg_centrality) %>%
  pull(id)
 
 #kable(top_20_in_list, col.names=c('Top 20 in-deg companies'))

Filter all records in the edge file involving the top 5 in-deg companies.

Next calculate the sum of all shipments between each company and each of it supplier company.

To further reduce the number of nodes, we will keep the records when the ‘total_shipped_weightkg’ is above 10,000,000 kg. This will retain only trading interactions involving a sizable weight (kg) of goods traded. This is an alternative to filtering using the ‘weight’ column which indicates the number of trading interactions. There are 19k edges after this step.

Show the code
# Subset the dataframe to keep only rows with valid values (top 20 in-deg) in 'target' column
mc2_seafood_edges_agg_in <- mc2_seafood_edges_agg[mc2_seafood_edges_agg$target %in% top_n_in_list, ]

# Calculate the sum of all the shipment weight between each pair of companies in edge file.
mc2_seafood_edges_agg_in <- mc2_seafood_edges_agg_in %>%
  group_by(source, target) %>%
  mutate(total_shipped_weightkg = sum(sum_goods_weightkg))

# further removal of records where 'weight' column is less than  10
mc2_seafood_edges_agg_in <- mc2_seafood_edges_agg_in %>% 
  filter(total_shipped_weightkg>=10000000)

The code chunk below first extracts the node dataframe from tbl_graph() object ‘seafood_graph’ created earlier. The reason for doing so is because we want the centrality values inside for tooltip later.

Next, trim this node dataframe by including nodes found only in the ‘from’ and ‘target’ columns in ‘mc2_seafood_edges_agg_in’ edge file. There are 62 nodes after this step.

Show the code
nodes_seafood_in <- as.data.frame(seafood_graph %>% activate(nodes))

nodes_seafood_in <- nodes_seafood_in %>%
  filter (id %in% c(mc2_seafood_edges_agg_in$source, mc2_seafood_edges_agg_in$target))

Rename the first two columns of the edge file to from and to for visNetwork to be able to recognise them. Create a title column for visNetwork to display the tooltip when we hover our mouse over the edges later.

Show the code
mc2_seafood_edges_agg_vis_in <- mc2_seafood_edges_agg_in %>% 
  rename(from = source) %>% 
  rename(to = target) %>% 
  mutate(title = paste('Total shipment weight = ',total_shipped_weightkg, "\n HSCODE =", hscode))

Similarly, in the nodes file we add a column title We will be displaying the ‘in-deg’, ‘betweenness’ and ‘out-deg’ scores in the tooltip.

Show the code
# extract nodes file from seafood_graph as a data frame
#nodes_seafood_vis <- as.data.frame(seafood_graph %>% activate(nodes))

# further processing
nodes_seafood_vis_in <- nodes_seafood_in %>% 
  #rename(group= rcvcountry)  %>% 
  mutate(pagerank = round(pagerank, 5)) %>% 
  mutate(title = paste('shpcountry =', shpcountry, ',',
                       'rcvcountry =', rcvcountry, ',',
                       '\n In-deg = ',in_deg_centrality, ',',
                       "\n Betweenness =", betweenness_centrality, ',',
                       "\n Out-deg =", out_deg_centrality))

As we want the nodes to be colour-coded by the in-deg centrality scores, we have first have to bin the in-deg scores into intervals using the cut() function. Next, rename the in_deg_grp column to group column for VisNetwork to colour nodes by in-deg score intervals.

Show the code
in_deg_brks <- c(0, 10000, 20000,30000,40000,50000, 60000,70000)
grps <- c('10,000 & Below','10,001-20,000', '20,001-30,000', '30,001-40,000','40,001-50,000', '50,001-60,000','60,001-70,000')

nodes_seafood_vis_in$in_deg_grp <- cut(nodes_seafood_vis_in$in_deg_centrality, breaks=in_deg_brks, labels = grps,include.lowest = TRUE)

#nodes_seafood_vis_in$in_deg_grp <- factor(nodes_seafood_vis$in_deg_grp, ordered = TRUE, levels = c('3001-6132','2001-3000','1001-2000','501-1000','500 & Below'))

nodes_seafood_vis_in <- nodes_seafood_vis_in %>% 
  rename(group = in_deg_grp)

The code chunk below plots in interactive network graph using visNetwork.

Show the code
set.seed(1234)
visNetwork(nodes_seafood_vis_in,
           mc2_seafood_edges_agg_vis_in,
           main = "Ego Network of top 5 IN-DEG companies",
           height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_nicely") %>%
  visEdges(arrows = 'to',
           smooth = list(enables = TRUE,
                         type= 'straightCross'),
           shadow = FALSE,
           dash = FALSE) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "group") %>%
  visInteraction(hideEdgesOnDrag = TRUE) %>% 
  visLegend() %>%
  visLayout(randomSeed = 123)

The plot above shows not only the filtered ego network of the top 5 companies (in terms of in-deg scores) but also how these five companies are related to one another. Red node has the highest in-deg centrality scores , followed by blue and yellow. The top 5 in-deg companies are all related and buys from ‘Sea Breezes S.A de C.V Freight’.

The table below shows the names of the top 5 in-deg companies.

Show the code
seafood_graph %>% 
  activate("nodes") %>% 
  as_tibble() %>% 
  arrange(desc(in_deg_centrality), desc(pagerank)) %>% 
  select(id,in_deg_centrality,pagerank, out_deg_centrality) %>% 
  head(n=5) %>% 
  kable()
id in_deg_centrality pagerank out_deg_centrality
Mar del Este CJSC 63332 0.0278945 32
hǎi dǎn Corporation Wharf 57221 0.0355972 16
Caracola del Sol Services 50655 0.0167972 6
Pao gan SE Seal 35156 0.0124541 5
Costa de la Felicidad Shipping 30701 0.0094311 6

3.2.1 Temporal analysis of top 20 ‘in-deg’ centrality companies in trading occurrence over the years

First, obtain the list of company ids in the top 20 in-deg ranking.

Show the code
n <- 20
top_n_in_list <- seafood_graph %>% 
  activate("nodes") %>% 
  as_tibble() %>% 
  arrange(desc(in_deg_centrality)) %>% 
  top_n(n,wt = in_deg_centrality) %>%
  pull(id)

Next, filter the edge file to keep only the records involving the top 20 in-deg companies. There are 185K records after this step.

Show the code
# Subset the dataframe to keep only rows with valid values (top 20 in-deg) in 'target' column
mc2_seafood_edges_agg_in_hm <- mc2_seafood_edges_agg[mc2_seafood_edges_agg$target %in% top_n_in_list, ]

Use the first code chunk below to compute the sum of monthly weights (interactons) between each unique pair of company in our filtered data. The second code chunk is to compute the year-month in which a source company has the max number of shipping interaction throughout the years.

Show the code
hm <- mc2_seafood_edges_agg_in_hm %>%
  mutate(month = floor_date(arrivaldate, unit = "month")) %>%
  group_by(target, month) %>%
  summarise(total_weight_by_month = sum(weight)) %>% 
  ungroup() 

hm <- hm  %>%  group_by(target) %>%
  mutate(maxweightmonth = as.Date(month[which.max(total_weight_by_month)])) %>%
  ungroup()

Now lets plot heatmap

Show the code
plotfrom <- "2028-01-02"
plotto <- '2034-01-02'

ggplot(hm, aes(x = month, y = fct_reorder(target, maxweightmonth), fill = total_weight_by_month)) +
  geom_tile(colour="White", show.legend=FALSE) +
  scale_fill_distiller(palette="Spectral") +
  scale_y_discrete(name="", expand=c(0,0))+
  scale_x_date(name="Arrival Date", 
               limits=as.Date(c(plotfrom, plotto)), 
               expand=c(0,0),date_breaks = "1 year", 
               date_labels = "%Y") +
  labs(title="Heatmap of shipping interactions",
       subtitle=paste0("Top receiving companies from ", 
                       plotfrom, ' to ', plotto)) +
  theme_classic() +
  
  theme(axis.line.y=element_blank(), plot.subtitle=element_text(size=rel(0.78)),
        plot.title.position="plot",
        axis.text.y=element_text(colour="Black",size=5), 
        plot.title=element_text(size=rel(2.3)))

Analysis of the plot above:

The heatmap above tells us the temporal trading interactions of the top 20 in-deg companies through the years. It is observed that company ‘Mar del Este CJSC’ has ‘hot’ patterns in the most recent years.

3.2.2 Comparing trading patterns of Mar del Este CJSC company and Pao gan SE Seal company with their suppliers

Let us examine the trading patterns of ‘Mar del Este’, the top leading companies in terms of ‘in-deg’ with four out of its many suppliers (as seen from the interactive graph).

First, filter records with only Mar del Este and 4 of its suppliers.

Show the code
mar_in_df<-mc2_seafood_edges_agg_vis_in %>%
  filter(to %in% 'Mar del Este CJSC')

Next, plot a time series using geom_line() and geom_point_interactive().

Show the code
mar_in_df <-  mar_in_df %>%
  mutate(tooltip = paste0('# of interaction: ', weight, '\nDate :', arrivaldate))

 
mar_in_df1<- ggplot(mar_in_df %>%
                          filter(from=='Olas del Mar N.V.'),
                        aes(x=arrivaldate, y=weight)) +
  geom_line( color="steelblue", size = 0.8) +
  geom_point_interactive(aes(tooltip= tooltip), 
                         size = 0.5) +
  xlab("") +
  theme_light() +
  theme(#panel.spacing = unit(2, "lines"),
        axis.text.x=element_text(angle=60, hjust=1),
        panel.grid.major.x = element_blank(),
        axis.title.y = element_text(size = 10)) +
  scale_x_date(
    breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
    limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
    labels = function(x) format(x, "%b %Y")) +
  scale_y_continuous(
    limits = c(0, 30), 
    breaks = seq(0, 30, by = 5),
    expand = c(0, 0)) +
  
  labs(title='Olas del Mar N.V.', 
       x = 'Date',
       y='Number of trading occurrence')

mar_in_df2<- ggplot(mar_in_df %>% filter(from=='Sea Breezes S.A. de C.V. Freight '), aes(x=arrivaldate, y=weight)) +
  geom_line( color="steelblue", size = 0.8) + 
  geom_point_interactive(aes(tooltip= tooltip),
                         size = 0.5) +
  xlab("") +
  theme_light() +
  theme(#panel.spacing = unit(2, "lines"),
        axis.text.x=element_text(angle=60, hjust=1),
        panel.grid.major.x = element_blank(),
        axis.title.y = element_text(size = 10)) +
  scale_x_date(
    breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), 
                 by = "3 months"),
    limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
    labels = function(x) format(x, "%b %Y")) +
  scale_y_continuous(
    limits = c(0, 30),
    breaks = seq(0, 30, by = 5),
    expand = c(0, 0)) +  # Remove padding around the y-axis limits

      
  labs(title='Sea Breezes S.A. de C.V. Freight ', 
       x = 'Date',
       y='Number of trading occurrence')


mar_in_df3<- ggplot(mar_in_df %>% filter(from=='Blue Horizon Family &'), aes(x=arrivaldate, y=weight)) +
  geom_line( color="steelblue", size = 0.8) + 
  geom_point_interactive(aes(tooltip= tooltip),
                         size = 0.5) +
  xlab("") +
  theme_light() +
  theme(#panel.spacing = unit(2, "lines"),
        axis.text.x=element_text(angle=60, hjust=1),
        panel.grid.major.x = element_blank(),
        axis.title.y = element_text(size = 10)) +
  scale_x_date(
    breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
    limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
    labels = function(x) format(x, "%b %Y")) +
  scale_y_continuous(
    limits = c(0, 30), 
    breaks = seq(0, 30, by = 5),
    expand = c(0, 0)) +
  
  labs(title='Blue Horizon Family &', 
       x = 'Date',
       y='Number of trading occurrence')

mar_in_df4<- ggplot(mar_in_df %>% filter(from=='Wave Watchers Ltd. Liability Co'), aes(x=arrivaldate, y=weight)) +
  geom_line( color="steelblue", size = 0.8) + 
  geom_point_interactive(aes(tooltip= tooltip),
                         size = 0.5) +
  xlab("") +
  theme_light() +
  theme(#panel.spacing = unit(2, "lines"),
        axis.text.x=element_text(angle=60, hjust=1),
        panel.grid.major.x = element_blank(),
        axis.title.y = element_text(size = 10)) +
  scale_x_date(
    breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
    limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
    labels = function(x) format(x, "%b %Y")) +
  scale_y_continuous(
    limits = c(0, 30), 
    breaks = seq(0, 30, by = 5),
    expand = c(0, 0)) +
  
  labs(title='Wave Watchers Ltd. Liability Co', 
       x = 'Date',
       y='Number of trading occurrence')



girafe(code = print(mar_in_df1 /mar_in_df2 /mar_in_df3/mar_in_df4),
       #width_svg = 6,
       height_svg =8,
       options = list(
         opts_hover(css = "fill: #202020;"),
         opts_hover_inv(css = "opacity:0.2;")
         )
       ) 

Now let us examine the trading patterns of ‘Pao gan SE Seal’, one of the few top leading companies in terms of ‘in-deg’ with four out of its many suppliers. First filter all records involving Pao gan and its 4 suppliers.

Show the code
paogan_in_df<- mc2_seafood_edges_agg_vis_in %>%
  filter(to %in% 'Pao gan SE Seal')

Next, plot a time series using geom_line() and geom_point_interactive().

Show the code
paogan_in_df <-  paogan_in_df %>%
  mutate(tooltip = paste0('# of interaction: ', weight, '\nDate :', arrivaldate))

 
paogan_in_df1<- ggplot(paogan_in_df %>%
                          filter(from=='Paradera S.A. de C.V.'),
                        aes(x=arrivaldate, y=weight)) +
  geom_line( color="steelblue", size = 0.8) +
  geom_point_interactive(aes(tooltip= tooltip), 
                         size = 0.5) +
  xlab("") +
  theme_light() +
  theme(#panel.spacing = unit(2, "lines"),
        axis.text.x=element_text(angle=60, hjust=1),
        panel.grid.major.x = element_blank(),
        axis.title.y = element_text(size = 10)) +
  scale_x_date(
    breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
    limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
    labels = function(x) format(x, "%b %Y")) +
  scale_y_continuous(
    limits = c(0, 100), 
    breaks = seq(0, 100, by = 20),
    expand = c(0, 0)) +
  
  labs(title='Paradera S.A. de C.V.', 
       x = 'Date',
       y='Number of trading occurrence')

paogan_in_df2<- ggplot(paogan_in_df %>% filter(from=='Greek Octopus SRL Logistics'), aes(x=arrivaldate, y=weight)) +
  geom_line( color="steelblue", size = 0.8) + 
  geom_point_interactive(aes(tooltip= tooltip),
                         size = 0.5) +
  xlab("") +
  theme_light() +
  theme(#panel.spacing = unit(2, "lines"),
        axis.text.x=element_text(angle=60, hjust=1),
        panel.grid.major.x = element_blank(),
        axis.title.y = element_text(size = 10)) +
  scale_x_date(
    breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), 
                 by = "3 months"),
    limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
    labels = function(x) format(x, "%b %Y")) +
  scale_y_continuous(
    limits = c(0, 100),  # Set the y-axis limits
    breaks = seq(0, 100, by = 20),  # Set the y-axis breaks at intervals of 5
    #minor_breaks = seq(0, 30, by = 1),  # Set the y-axis minor breaks at intervals of 1
    expand = c(0, 0)) +  # Remove padding around the y-axis limits

      
  labs(title='Greek Octopus SRL Logistics', 
       x = 'Date',
       y='Number of trading occurrence')


paogan_in_df3<- ggplot(paogan_in_df %>% filter(from=='Madhya Pradesh  Market LLC'), aes(x=arrivaldate, y=weight)) +
  geom_line( color="steelblue", size = 0.8) + 
  geom_point_interactive(aes(tooltip= tooltip),
                         size = 0.5) +
  xlab("") +
  theme_light() +
  theme(#panel.spacing = unit(2, "lines"),
        axis.text.x=element_text(angle=60, hjust=1),
        panel.grid.major.x = element_blank(),
        axis.title.y = element_text(size = 10)) +
  scale_x_date(
    breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
    limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
    labels = function(x) format(x, "%b %Y")) +
  scale_y_continuous(
    limits = c(0, 100),
    breaks = seq(0, 100, by = 20), 
    expand = c(0, 0)) +
  
  labs(title='Madhya Pradesh  Market LLC', 
       x = 'Date',
       y='Number of trading occurrence')

paogan_in_df4<- ggplot(paogan_in_df %>% filter(from=='Sea Breezes S.A. de C.V. Freight '), aes(x=arrivaldate, y=weight)) +
  geom_line( color="steelblue", size = 0.8) + 
  geom_point_interactive(aes(tooltip= tooltip),
                         size = 0.5) +
  xlab("") +
  theme_light() +
  theme(#panel.spacing = unit(2, "lines"),
        axis.text.x=element_text(angle=60, hjust=1),
        panel.grid.major.x = element_blank(),
        axis.title.y = element_text(size = 10)) +
  scale_x_date(
    breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
    limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
    labels = function(x) format(x, "%b %Y")) +
  scale_y_continuous(
    limits = c(0, 100), 
    breaks = seq(0, 100, by = 20),  
    expand = c(0, 0)) +
  labs(title='Sea Breezes S.A. de C.V. Freight ', 
       x = 'Date',
       y='Number of trading occurrence')

girafe(code = print(paogan_in_df1 /paogan_in_df2 /paogan_in_df3 /paogan_in_df4),
       #width_svg = 6,
       height_svg =8,
       options = list(
         opts_hover(css = "fill: #202020;"),
         opts_hover_inv(css = "opacity:0.2;")
         )
       ) 

Analysis of the plot above: The trading interaction of ‘Mar del Este’ with its suppliers is more consistent throughout as compared to ‘Pao gan SE Seal’. Throughout the years, ‘Pao gan SE Seal’ has interactions many suppliers but some of these relationships are short term. . It dawned on me that we can identify IUU companies that frequently close down and re-register their companies from the type of plot above. For such companies, I would expect the buyer’s graph to show that buyer has been changing suppliers very frequently.

For our group project we can consider creating drop down list of ‘buying companies’ for user to interact with and see each of the buyer’s interaction with their top n suppliers over the years. A coordinated view can be created with social network graph such that when a buyer / supplier node is selected on the network graph, its trading activity over time with top n suppliers/ buyers is automatically generated.

3.3 Interactive graph for top 5 OUT-DEGREE CENTRALITY companies

Let us get the top 5 company names in terms of out-deg centrality scores.

Show the code
n <- 5
top_n_out_list <- seafood_graph %>% 
  activate("nodes") %>% 
  as_tibble() %>% 
  arrange(desc(out_deg_centrality)) %>% 
  top_n(n, wt = out_deg_centrality) %>%
  pull(id)
#kable(top_20_out_list, col.names=c('Top 20 out-deg companies'))

Filter all records in the edge file involving the top 5 out-deg companies.

Next calculate the sum of all shipments between each company and each of it buyercompany.

To further reduce the number of nodes, we will keep the records when the ‘total_shipped_weightkg’ is above 10,000,000 kg. This will retain only trading interactions involving a sizable weight (kg) of goods traded. This is an alternative to filtering using the ‘weight’ column which indicates the number of trading interactions. There are 10k edges after this step.

Show the code
# Subset the dataframe to keep only rows with valid values (top 5 out-deg) in 'from' column (15K rows)
mc2_seafood_edges_agg_out <- mc2_seafood_edges_agg[mc2_seafood_edges_agg$source %in% top_n_out_list, ]

# Calculate the sum of all the shipment weight between each pair of companies in edge file.
mc2_seafood_edges_agg_out <- mc2_seafood_edges_agg_out %>%
  group_by(source, target) %>%
  mutate(total_shipped_weightkg = sum(sum_goods_weightkg))

# further removal of records where 'weight' column is less than  10
mc2_seafood_edges_agg_out <- mc2_seafood_edges_agg_out %>% 
  filter(total_shipped_weightkg>=10000000)

The first code chunk below extracts the ‘nodes’ table from tbl_graph() object ‘seafood_graph’ created earlier. The reason for doing so is because we want the centrality values inside for tooltip later.

Next, trim this node dataframe by including nodes found only in the ‘from’ and ‘target’ columns in ‘mc2_seafood_edges_agg_in’ edge file. There are 24 nodes after this step.

Show the code
nodes_seafood_out <- as.data.frame(seafood_graph %>% activate(nodes))

nodes_seafood_out <- nodes_seafood_out %>%
  filter (id %in% c(mc2_seafood_edges_agg_out$source, mc2_seafood_edges_agg_out$target))

Rename the first two columns of the edge file to from and to for visNetwork to be able to recognise them. Create a title column for visNetwork to display the tooltip when we hover our mouse over the edges later.

Show the code
mc2_seafood_edges_agg_vis_out <- mc2_seafood_edges_agg_out %>% 
  rename(from = source) %>% 
  rename(to = target) %>% 
  mutate(title = paste('Total shipment weight = '
                       ,total_shipped_weightkg,
                       "\n HSCODE =",
                       hscode))

Similarly, in the nodes file we add a column title We will be displaying the ‘in-deg’, ‘betweenness’ and ‘out-deg’ scores in the tooltip.

Show the code
nodes_seafood_vis_out <- nodes_seafood_out %>% 
  #rename(group= rcvcountry)  %>% 
  mutate(pagerank = round(pagerank, 5)) %>% 
  mutate(title = paste('shpcountry =', shpcountry, ',',
                       'rcvcountry =', rcvcountry, ',',
                       '\n In-deg = ',in_deg_centrality, ',',
                       "\n Betweenness =", betweenness_centrality, ',',
                       "\n Out-deg =", out_deg_centrality))

As we want the nodes to be colour-coded by the out-deg centrality scores, we have first have to bin the out-deg scores into intervals using the cut() function. Next, rename the out_deg_grp column to group column for VisNetwork to colour nodes by out-deg score intervals.

Show the code
out_deg_brks <- c(0,10000, 20000, 30000)
grps <- c('10,000 & Below','10,001-20,001', '20,001-27,570')

nodes_seafood_vis_out$out_deg_grp <- cut(nodes_seafood_vis_out$out_deg_centrality, breaks=out_deg_brks, labels = grps,include.lowest = TRUE)


nodes_seafood_vis_out <- nodes_seafood_vis_out %>% 
  rename(group = out_deg_grp)

The code chunk below plots the interactive graph.

Show the code
visNetwork(nodes_seafood_vis_out,
           mc2_seafood_edges_agg_vis_out,
           main = "Ego Network of top 5 OUT-DEG companies",
           height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_nicely") %>%
  visEdges(arrows = 'to',
           smooth = list(enables = TRUE,
                         type= 'straightCross'),
           shadow = FALSE,
           dash = FALSE) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "group") %>%
  visInteraction(hideEdgesOnDrag = TRUE) %>% 
  visLegend() %>%
  visLayout(randomSeed = 123)

The chart highlights actors with high out-deg centrality scores over 2028 to 2034. In this default view, we can quickly identify the 2 coloured nodes as top 2 suppliers. Both top out-deg companies supplies to 3 common target companies; they are “Coasta de la Felicidad’, ‘Niger Bend Limited Liability..’ and”Pao gan SE Seal’.

The table below shows the names of the top 5 out-deg companies

Show the code
seafood_graph %>% 
  activate("nodes") %>% 
  as_tibble() %>% 
  arrange(desc(out_deg_centrality)) %>% 
  select(id,out_deg_centrality,in_deg_centrality,pagerank) %>% 
  head(n=5) %>% 
  kable()
id out_deg_centrality in_deg_centrality pagerank
nián yú Ltd. Corporation 27570 2 8.55e-05
Sea Breezes S.A. de C.V. Freight 24283 10 5.29e-05
Estrella de la Costa SRL 8862 7 8.12e-05
Blue Horizon Family & 7670 0 4.16e-05
Madhya Pradesh Market LLC 6789 1 4.27e-05

3.3.1 Temporal analysis of top 20 ‘out-deg’ centrality companies in trading occurrence over the years

First, obtain the list of company ids in the top 20 in-deg ranking.

Show the code
n <- 20
top_n_out_list <- seafood_graph %>% 
  activate("nodes") %>% 
  as_tibble() %>% 
  arrange(desc(out_deg_centrality)) %>% 
  top_n(n, wt = out_deg_centrality) %>%
  pull(id)

Next, filter the edge file to keep only the records involving the top 20 out-deg companies. There are 43K records after this step

Show the code
# Subset the dataframe to keep only rows with valid values (top 20 out-deg) in 'target' column
mc2_seafood_edges_agg_out_hm <- mc2_seafood_edges_agg[mc2_seafood_edges_agg$source %in% top_n_out_list, ]

Use the first code chunk below to compute the sum of monthly weights (interactions) between each unique pair of company in our filtered data. The second code chunk is to compute the year-month in which a source company has the max number of shipping interaction throughout the years.

Show the code
hm <- mc2_seafood_edges_agg_out_hm %>%
  mutate(month = floor_date(arrivaldate, unit = "month")) %>%
  group_by(source, month) %>%
  summarise(total_weight_by_month = sum(weight)) %>% 
  ungroup() 

hm <- hm  %>%  group_by(source) %>%
  mutate(maxweightmonth = as.Date(month[which.max(total_weight_by_month)])) %>%
  ungroup()

We will now plot the heatmap of top 20 out-deg companies to show their shipping interactions over the years.

Show the code
plotfrom <- "2028-01-02"
plotto <- '2034-01-02'

ggplot(hm, aes(x = month, y = fct_reorder(source, maxweightmonth), fill = total_weight_by_month)) +
  geom_tile(colour="White", show.legend=FALSE) +
  scale_fill_distiller(palette="Spectral") +
  scale_y_discrete(name="", expand=c(0,0))+
  scale_x_date(name="Arrival Date", 
               limits=as.Date(c(plotfrom, plotto)), 
               expand=c(0,0),date_breaks = "1 year", 
               date_labels = "%Y") +
  labs(title="Heatmap of shipping interactions",
       subtitle=paste0("Top supplier companies from ", 
                       plotfrom, ' to ', plotto)) +
  theme_classic() +
  
  theme(axis.line.y=element_blank(), plot.subtitle=element_text(size=rel(0.78)),
        plot.title.position="plot",
        axis.text.y=element_text(colour="Black",size=5), 
        plot.title=element_text(size=rel(2.3)))

Analysis of the plot above:

At a quick glance, three companies are hot, they are ‘nian yu Ltd Corporation’ ,‘Playa del Tesoro OJSC’. and ‘Sea Breezes SA’.

‘nian yu Ltd Corporation’ had several instances of very high shipping activities in 2031 and 2032, which is worth looking into.

‘Madhya Pradesh Market LLC’ had interactions only from 2029 to 2032 first quarter. What happened to it?

3.3.2 Treemap of business relationship between shipping and receiving companies

Finally, lets build a treemap to get a high level view of the companies that these main out-deg companies ship the goods to.

The code chunk below groups the data by from and to columns and aggregating the TotalInteractions and MedianCargoWeight_daily between each pair shipping and receiving company by hscode.

Show the code
seafood_tree <-mc2_seafood_edges_agg_out_hm %>% 
  group_by(source,target) %>% 
  summarise(TotalInteractions=sum(weight),
            Totalshipment= sum(sum_goods_weightkg)) %>% 
  ungroup() %>% 
  arrange(desc(TotalInteractions))

The code chunk below plots the treemap using the treemap library.

Show the code
tm<- treemap(seafood_tree,
        index=c("source", "target"),
        vSize="TotalInteractions",
        vColor="Totalshipment",
        type="value",
        palette="RdYlBu", 
        algorithm = "squarified",
        title='Shipping and receiving companies interaction pattern',
        title.legend = "Total shipment weight"
        )

The most outer layer refers to shipping companies while the tiles within represents the companies that they shipped their goods to. The bigger the tile size, the more the interaction. The darker the colour, the greater the total shipment weight accumulated across the years.

The largest player here is ‘nian yu Ltd Corporation’ .

‘Playa del Tesoro OJSC’ is the ranked fifth in terms of out-deg centrailty and it ships in high frequency and large volume to its partners over the years 2028 to 2034.

The interactive version created with d3treeR library. But since this is not available in CRAN, we might not be able to use this in Shiny.

Show the code
library(d3treeR)
Show the code
d3tree(tm,rootname = "Supplier Companies" )

3.4 Correlationship between partnership intervals (in days) and total number of interactions between company pairs

In this section, lets examine whether there is a correlationship between the interval of interaction within a year and the number of interactions between each pairs of companies. Will there be companies with short partnership duration and high number of interactions (e.g. shipping in high frequency for only 1 week within a year) ?

We will prepare the dataframe needed for the plot.

First, for each group of from, to, year:

1) partnership_days : the number of days within the year that each pair of companies had interactions

2) total_interaction : sum of all the Weights between each pair of companies

Show the code
cor <- mc2_seafood_edges_agg_vis_in %>% 
  group_by(from, to,year) %>% 
  summarise(partnership_days=as.integer(max(arrivaldate)-min(arrivaldate)+1),
            total_interaction = sum(weight),
            median_interaction = median(weight)) %>% 
  ungroup() %>% 
  arrange(partnership_days)

3.4.1 Checking for statistical significance in correlationship

Show the code
library(scales)
correl_2028 <- ggscatterstats(data = cor %>% filter(year==2028), 
                           x = partnership_days, y = total_interaction,
                           type = "nonparametric") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  scale_x_continuous(limits = c(0, 365), 
                     breaks=seq(0, 365, 30), 
                     labels= comma) + 
  scale_y_continuous(limits = c(0, 1250), 
                     breaks=seq(0, 1250, 250), 
                     labels= comma) +
  
  labs(title = "Number of interactions and partnership duration in 2028", 
       x = "Partnership days", y = "Total interactions") 

correl_2029 <- ggscatterstats(data = cor %>% filter(year==2029), 
                           x = partnership_days, y = total_interaction,
                           type = "nonparametric") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  scale_x_continuous(limits = c(0, 365), 
                     breaks=seq(0, 365, 30), 
                     labels= comma) + 
  scale_y_continuous(limits = c(0, 1250), 
                     breaks=seq(0, 1250, 250), 
                     labels= comma) +
  
  labs(title = "Number of interactions and partnership duration in 2029", 
       x = "Partnership days", y = "Total interactions") 


correl_2030 <- ggscatterstats(data = cor %>% filter(year==2030), 
                           x = partnership_days, y = total_interaction,
                           type = "nonparametric") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  scale_x_continuous(limits = c(0, 365), 
                     breaks=seq(0, 365, 30), 
                     labels= comma) + 
  scale_y_continuous(limits = c(0, 1250), 
                     breaks=seq(0, 1250, 250), 
                     labels= comma) +
  
  labs(title = "Number of interactions and partnership duration in 2030", 
       x = "Partnership days", y = "Total interactions") 


correl_2031 <- ggscatterstats(data = cor %>% filter(year==2031), 
                           x = partnership_days, y = total_interaction,
                           type = "nonparametric") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  scale_x_continuous(limits = c(0, 365), 
                     breaks=seq(0, 365, 30), 
                     labels= comma) + 
  scale_y_continuous(limits = c(0, 1250), 
                     breaks=seq(0, 1250, 250), 
                     labels= comma) +
  
  labs(title = "Number of interactions and partnership duration in 2031", 
       x = "Partnership days", y = "Total interactions") 


correl_2032 <- ggscatterstats(data = cor %>% filter(year==2032), 
                           x = partnership_days, y = total_interaction,
                           type = "nonparametric") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  scale_x_continuous(limits = c(0, 365), 
                     breaks=seq(0, 365, 30), 
                     labels= comma) + 
  scale_y_continuous(limits = c(0, 1250), 
                     breaks=seq(0, 1250, 250), 
                     labels= comma) +
  
  labs(title = "Number of interactions and partnership duration in 2032", 
       x = "Partnership days", y = "Total interactions") 


correl_2033 <- ggscatterstats(data = cor %>% filter(year==2033), 
                           x = partnership_days, y = total_interaction,
                           type = "nonparametric") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  scale_x_continuous(limits = c(0, 365), 
                     breaks=seq(0, 365, 30), 
                     labels= comma) + 
  scale_y_continuous(limits = c(0, 1250), 
                     breaks=seq(0, 1250, 250), 
                     labels= comma) +
  
  labs(title = "Number of interactions and partnership duration in 2033", 
       x = "Partnership days", y = "Total interactions") 



correl_2034 <- ggscatterstats(data = cor %>% filter(year==2034), 
                           x = partnership_days, y = total_interaction,
                           type = "nonparametric") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  scale_x_continuous(limits = c(0, 365), 
                     breaks=seq(0, 365, 30), 
                     labels= comma) + 
  scale_y_continuous(limits = c(0, 1250), 
                     breaks=seq(0, 1250, 250), 
                     labels= comma) +
  
  labs(title = "Number of interactions and partnership duration in 2034", 
       x = "Partnership days", y = "Total interactions") 

# combining plots using patchwork
p_correl <- (correl_2034 + correl_2033) / (correl_2032 + correl_2031) / (correl_2030 + correl_2029) # + plot_spacer() + plot_spacer()
p_correl + plot_annotation(title = "Correlation between Number of interactions and partnership duration (days)", 
                           theme = theme(plot.title = element_text(size = 18),
                                         plot.subtitle = element_text(size = 12))) + plot_layout(ncol = 1, nrow = 3,
                                                                                                 heights = c(2,2))

The plots (non-parametric) above has p-values less than 0.05 and it suggests that there is a correlationship between the rank-transformed data of total interactions and partnership duration between companies . The upper outliers pairs of companies could be worth investigating because they have high number of interactions for a particular partnership duration in a year. For example, if a company A had interaction with company B for only 3 months but with exceptionally high number of trading interactions, should both companies be worth investigating?

3.4.2 Coordinated and interactive scatterplot

For usability , lets us plot an interactive and coordinate equivalent plot of the above scatterplot chart for the most recent four years 2034, 2033, 2032 and 2031.

First we will prepare the tooltip to be shown.

Show the code
cor <- cor %>%
  mutate(label1 = group_indices(., from, to)) %>% 
  mutate(fromto = paste('From: ', from,
                        'To: ', to)) %>% 
  mutate(tooltip = paste(fromto,
                         '\nTotal interaction: ', total_interaction,
                         '\nPartnership duration: ', partnership_days, 'days'))

Next, we will filter and use geom_point_interactive() of ggiraph library to plot.

Show the code
scatter_2034 <- ggplot(data=cor%>% filter(year==2034),
       aes(x=partnership_days, y=total_interaction)) +
  geom_point_interactive(aes(data_id = label1,
                             tooltip= tooltip),
                         size = 0.8) +
  geom_smooth(method='lm',
              size = 0.5) +
  scale_x_continuous(limits = c(0, 365), breaks = seq(0, 365, 30)) +
  scale_y_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 200)) +

  labs(title = "Scatterplot of # of interactions and Partnership duration in 2034",
       y = "Total interactions",
       x = "Partnership duration") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12, face ='bold'),
        axis.title = element_text(size = 8, face= 'bold'),
        axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text = element_text(size = 8))


scatter_2033 <- ggplot(data=cor%>% filter(year==2033),
       aes(x=partnership_days, y=total_interaction)) +
  geom_point_interactive(aes(data_id = label1,
                             tooltip= tooltip),
                         size = 0.8) +
  geom_smooth(method='lm',
              size = 0.5) +
  scale_x_continuous(limits = c(0, 365), breaks = seq(0, 365, 30)) +
  scale_y_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 200)) +

  labs(title = "Scatterplot of # of interactions and Partnership duration in 2033",
       y = "Total interactions",
       x = "Partnership duration") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12,face = 'bold'),
        axis.title = element_text(size = 8, face= 'bold'),
        axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text = element_text(size = 8))

scatter_2032 <- ggplot(data=cor%>% filter(year==2032),
       aes(x=partnership_days, y=total_interaction)) +
  geom_point_interactive(aes(data_id = label1,
                             tooltip= tooltip),
                         size = 0.8) +
  geom_smooth(method='lm',
              size = 0.5) +
  scale_x_continuous(limits = c(0, 365), breaks = seq(0, 365, 30)) +
  scale_y_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 200)) +

  labs(title = "Scatterplot of # of interactions and Partnership duration in 2032",
       y = "Total interactions",
       x = "Partnership duration") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12,face = 'bold'),
        axis.title = element_text(size = 8, face= 'bold'),
        axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text = element_text(size = 8))


scatter_2031 <- ggplot(data=cor%>% filter(year==2031),
       aes(x=partnership_days, y=total_interaction)) +
  geom_point_interactive(aes(data_id = label1,
                             tooltip= tooltip),
                         size = 0.8) +
  geom_smooth(method='lm',
              size = 0.5) +
  scale_x_continuous(limits = c(0, 365), breaks = seq(0, 365, 30)) +
  scale_y_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 200)) +

  labs(title = "Scatterplot of # of interactions and Partnership duration in 2031",
       y = "Total interactions",
       x = "Partnership duration") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12,face = 'bold'),
        axis.title = element_text(size = 8, face= 'bold'),
        axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text = element_text(size = 8))




girafe(code = print(scatter_2034 / scatter_2033 /scatter_2032 /scatter_2031),
       width_svg = 6,
       height_svg = 7,
       options = list(
         opts_hover(css = "fill: #202020;"),
         opts_hover_inv(css = "opacity:0.2;")
         )
       ) 
Note

Features of the chart above

  • Each point in chart represents a pair of companies that ships and receives, hover mouse over to see ‘company pair names’, ‘total interaction’ and ‘partnership days’ inside tooltip

  • If a particular company pair exists in more than one chart , its corresponding point will also be highlighted in the other charts.

If two companies consistently have similar number of interactions within similar partnership days across the years, then there is no issue. Explore the interactive chart to spot pairs of companies with ‘solo’ occurrence throughout the years and is also one of the outlier pairs. Those pair of companies could be worth investigating.

3.5 Comparison of centrality values across shipping countries

In this section, we will compare the centrality values of companies from different countries.

3.5.1 Comparison of out-deg centrality between top shipping countries

Let us compute top 5 shipping countries in terms of out-deg centrality scores.

Show the code
nodes_seafood_vis_out %>% 
  group_by(shpcountry) %>% 
  summarise(sum_out_deg = sum(out_deg_centrality)) %>% 
  arrange(desc(sum_out_deg)) %>% 
  head(n=5) %>% kable()
shpcountry sum_out_deg
Merigrad 34396
Isliandor 24283
Mawazam 8891
Osterivaria 7670
Marebak 22

The code chunk below filters the top 5 shipping countries and non parametric one way anova test is performed to compare for significance in difference in median of number of interactions between them.

Show the code
ggbetweenstats(data = nodes_seafood_vis_out %>% 
                 filter(shpcountry %in% c("Isliandor", "Osterivaria", "Vesperanda", "Mawazam", "Merigrad")),
               x = shpcountry, 
               y = out_deg_centrality,
               xlab = "Shipping Country", ylab = "Out-Deg centrality",
               type = "np", pairwise.comparisons = TRUE, pairwise.display = "s",
               sort = "descending",
               sort.fun = median,
               mean.ci = T, p.adjust.method = "fdr",  conf.level = 0.95,
               title = "Comparison of Median Out-Deg centrality across shipping Countries") +
  scale_y_continuous(limits = c(0, 4000)) +
  theme(axis.title.y=element_text(angle = 0,
                                  vjust=0.9))

The P - value is above 0.05, so there is not enough statistical evidence to reject the null hypothesis that the median out-deg centrality scores between top 5 shipping countries are different.

3.5.2 Comparison of in-deg centrality between top receiving countries

Let us compute top 5 receiving countries in terms of in-deg centrality scores.

Show the code
nodes_seafood_vis_in %>% 
  group_by(rcvcountry) %>% 
  summarise(sum_in_deg = sum(in_deg_centrality)) %>% 
  arrange(desc(sum_in_deg)) %>% 
  head(n=5)
# A tibble: 5 × 2
  rcvcountry  sum_in_deg
  <chr>            <dbl>
1 Oceanus         240470
2 Coralmarica         33
3 Marebak              4
4 Merigrad             1
5 Faraluna             0

There are only 3 major receiving countries in our filtered dataset .

Show the code
ggbetweenstats(data = nodes_seafood_vis_in %>% filter(rcvcountry %in% c("Jiraputra", "Oceanus",'Coralmarica','Marebak','Mawazam')), x = rcvcountry, y = in_deg_centrality,
               xlab = "Shipping Country", ylab = "In-Deg centrality",
               type = "np", pairwise.comparisons = TRUE, pairwise.display = "s",
               sort = "descending",
               sort.fun = median,
               mean.ci = T, p.adjust.method = "fdr",  conf.level = 0.95,
               title = "Comparison of Median In-Deg centrality across receiving Countries") +
  scale_y_continuous(limits = c(0, 150)) +
   theme(axis.title.y=element_text(angle = 0,
                                  vjust=0.9))

The P - value is above 0.05, so there is not enough statistical evidence to reject the null hypothesis that the median in-deg centrality scores between top 5 receiving countries are different.

3.6 Conclusion

In this exercise, we have identified the key centrality actors in terms of betweenness, in-degree and out-degree. My key findings are such:

  • Companies high in betweenness scores ‘buy’ and ‘sell’, although they tend to focus on one of the two activities. However, this companies usually do not have the highest in and out degree scores. We can derive simple business relationships as seen in section 3.1 and to get more relationship, we can simply extend the network graph.

  • Companies with high in-degree scores tend to focus heavily on buying only and companies with highh out-degree scores tend to focus on selling only. Companies with high in/out centrality scores can share similar suppliers/ buyers. A treemap (section 3.3.3) will be able to help us see such relationship.

  • Heatmaps allows us to see shipping activities for many companies at one go. We can look out for companies who have spikes in their selling and buying patterns to fish out illegal activities.

  • Multiple line plots (section 3.2.2) can show us the shipping patterns between a target company and some of its source companies over time. From target companies who keeps changing source companies, we could try to investigate them to check if they are dealing with companies who kept de-registering after being caught for IUU and re-registering to continue the business.

My key takeaway from this exercise:

  • I filtered records too early, even though some edge weights are small, but they could provide the link to finding high betweenness nodes. By chopping away these edges to early, I was unable to get any betweenness scores for the remaining network in my filtered dataset.

4 References

Hohenfeld, F. (2021, August 12). Graphs Are Fun: An Introduction to Graphs in R. Hohenfeld.is. Retrieved from https://hohenfeld.is/posts/graphs-are-fun-an-introduction-to-graphs-in-r/

R Core Team. (2021, September 13). ggraph: A Grammar of Graphics for Graphs and Networks. The Comprehensive R Archive Network (CRAN). Retrieved from https://cran.r-project.org/web/packages/ggraph/vignettes/Edges.html

Datastorm-Open. (n.d.). visNetwork. GitHub Pages. Retrieved from http://datastorm-open.github.io/visNetwork/

Rivasiker, G. (n.d.). ggHoriPlot: Interactive Horizon Plot for R. GitHub Pages. Retrieved from https://rivasiker.github.io/ggHoriPlot/

Designing Interactive Treemap Using D3TreeR.” (n.d.). Retrieved from https://r4va.netlify.app/chap16.html#designing-interactive-treemap-using-d3treer

Public Health England. (2020). Visualising the spread of COVID-19 in England [PDF document]. Retrieved from https://extra.shu.ac.uk/ppp-online/wp-content/uploads/2020/11/visualising-spread-COVID19-england.pdf